September 15, 2021

Howdy!

Outline

  • Poll: Getting to Know You
  • What is data scraping?
  • Why use R to scrape data?
  • Example 1: Scraping static websites with HTML/CSS.
  • Example 2: Scraping dynamic websites with API interception.

Getting to Know You

What is Data Scraping?

Scraping data means transforming human-readable data into machine-readable data.

We scrape data when:

  • The data is available online; and,
  • We want to work with it; but,
  • It’s in an inconvenient format.

We could also think of it as the creative acquisition of machine-readable data.

Why Use R for Data Scraping?

  1. A robust toolkit.
    • There are many great R packages that make scraping relatively easy and painless.
  2. Code-based power, repeatability, and flexibility.
    • A code-based approach gives you precise control, lets you re-run or re-purpose analyses as needed.
  3. A single environment.
    • Scrape your data and then analyse it from the comfort of a single RStudio session.

Use Case: Mapping Ottawa’s Food Environment

  • Research suggests that access to healthy food is an important factor in public health.
  • To better understand any inequities and access issues in Ottawa, the Ottawa Neighbourhood Study is taking an inventory of Ottawa’s food environment.
  • This means finding every grocery store, restaurant, convenience store, bakery…

This is a perfect use-case for data scraping.

Case 1: Static HTML/CSS

A static website is a collection of html (and other) files that you download from a server and view in your browser.

  • It’s the files themselves that are static: everyone who visits gets the same files.
  • But static websites can still be interactive!

Case 1: Static HTML/CSS: The Basic Recipe

Because the data is all contained in static files, scraping a static website generally follows this recipe:

  1. Download the site’s html in R with rvest::read_html().
  2. Find the CSS selectors you need using your browser and SelectorGadget.
  3. Find any html attributes you need by viewing the source html in your browser.
  4. Extract the data in R using rvest::html_elements() and rvest::html_attrs().

In practice, of course, the steps don’t usually go in this nice order :)

Case 1: Static HTML/CSS: Foodland

To find all Foodland locations, first we use our browser to find the url for Foodland’s store locator page and use SelectorGadget to find the CSS selector for each store’s information.

Then we can read the site in R and get the store data:

# read website's html
html <- rvest::read_html("https://foodland.ca/store-locator/")
# separate out the sections for each store
stores <- rvest::html_elements(html, css = ".brand-foodland-store-location")
# isolate the first store for testing
store <- stores[[1]]
store
## {html_node}
## <div class="store-result brand-foodland-store-location" data-js-store-result="" data-id="83832" data-lng="-65.8315" data-lat="46.7291" data-city="blackville" data-province="nb" data-postal-code="e9b 1n3" data-taxonomy-services="" data-taxonomy-types="" data-hours="{"monday":"8:00 a.m. to 8:00 p.m.","tuesday":"8:00 a.m. to 8:00 p.m.","wednesday":"8:00 a.m. to 8:00 p.m.","thursday":"8:00 a.m. to 8:00 p.m.","friday":"8:00 a.m. to 8:00 p.m.","saturday":"8:00 a.m. to 8:00 p.m.","sunday":"10:00 a.m. to 6:00 p.m."}" data-brand="foodland-store-location">
## [1] <div class="equal_height">\n\t\t\t\t\t\t\t<div>\n\t\t\t\t\t\t\t\t<h4><a a ...

Case 1: Static HTML/CSS: Scraping Attributes

By inspecting the raw html in our browser with view-source:, we find that some data is stored as invisible attributes. We can extract them with rvest::html_attr():

# extract lat/lon coords using html attributes
lat <- rvest::html_attr(store, "data-lat")
lon <- rvest::html_attr(store, "data-lng")
# print to console
c(lat, lon)
## [1] "46.7291"  "-65.8315"

Case 1: Static HTML/CSS: Scraping Text

Some data is presented only as human-readable text, so we can extract it using CSS selectors that we find again with SelectorGadget.

We use html_elements() to get the html snippets for each item, then html_text() to get the text.

Here we get the city for the first store:

city <- rvest::html_elements(store, css = ".city")
rvest::html_text(city)
## [1] "Blackville"

Case 1: Static HTML/CSS: Finishing Up

  • We’ve seen how to get some clean data for one store.
  • The next step is to write code to get all the data for one store.
  • Then, to extract information for all stores, you would iterate over each store:
    • Either using a for loop; or,
    • Using a vectorized approach with purrr::map() or lapply().

For a complete worked example, see the first example workbook for this talk.

Case 1: Foodland’s Locations

Case 1: Static HTML/CSS: Tables and Forms

If a website has a table of values, you’re in luck:

  • rvest::html_table() automatically puts tabular data into a structured data frame.

If a website has a form you need to fill and submit, you’re not out of luck:

  • The functions rvest::html_form* can help you to fill and submit web forms automatically and read the responses.

Any Questions So Far?

Case 2: API Interception: The General Idea

  • Dynamic websites, for our purposes, are not just static files that your browser displays.
    • They are more like computer programs that run in your browser.
  • As you use a dynamic site, your browser is actively downloading new data in response to your actions.
  • The data comes from API calls.
  • By watching what your browser does, you can find those API calls and make them yourself using R.

Case 2: API Interception: The Recipe

  1. Use the Google developer console to monitor Chrome’s network activity and find the relevant API call(s).
  2. Reverse-engineer the API calls to find out how to ask for the data you want.
  3. Make the API calls and store the results, using loops or other techniques as appropriate.
  4. Tidy the results.

Case 2: API Interception: Circle K

  • We want to find all the Circle K convenience stores in Ottawa.
  • We try Circle K’s store locator
  • But when we view-source: in our browser, the data isn’t there!
  • So instead we use the developer console to monitor network traffic…
    • And we find an API call that returns the store data!
  • And we try the API url in our browser…
    • And it gives us the same response!

We’re ready to pull this data directly into R.

Case 2: API Interception: Calling the API

# query the extremely long url for the API call
url <- "https://www.circlek.com/stores_new.php?lat=45.421&lng=-75.69&services=&region=global&page=0"

resp <- httr::GET(url)

# extract the content from the response, and parse the JSON result
stores <- httr::content(resp, type = "text/json", encoding = "UTF-8") %>%
      jsonlite::fromJSON()

# inspect the structure of the response
str(stores, max.level = 1)
## List of 5
##  $ count      : int 10
##  $ page       : int 1
##  $ division   : chr "ontario"
##  $ stores     :List of 10
##  $ tactic_urls:List of 12

Case 2: API Interception: Parsing the Response

Parsing complex lists can be a pain, but Circle K’s response is easy to tidy:

# convert the response to a nested data frame, and then unnest the data
stores$stores %>%
  enframe() %>%
  unnest_wider(value) %>%
  select(display_brand, address, city, latitude, longitude) %>%
  head(5)
## # A tibble: 5 x 5
##   display_brand address                   city   latitude   longitude  
##   <chr>         <chr>                     <chr>  <chr>      <chr>      
## 1 Mac's         "11-160 Elgin St., "      OTTAWA 45.4197298 -75.6928678
## 2 Circle K      "388 Elgin Street"        OTTAWA 45.4144881 -75.6876387
## 3 Circle K      "120 Osgoode St."         OTTAWA 45.4237455 -75.6809046
## 4 Circle K      "210 Laurier Avenue East" OTTAWA 45.4256518 -75.6817806
## 5 Circle K      "333 Rideau Street"       OTTAWA 45.4296923 -75.6843574

Case 2: API Interception: Deconstructing the API

For this API, request parameters are sent in the url after ? and separated by &.

https://www.circlek.com/stores_new.php?lat=45.421&lng=-75.69&services=&region=global&page=0

So the parameters here are:

  • lat=45.421, lng=-75.69: The geographic coordinates for the search.
  • services=: Blank in this request; maybe to look for specific services?
  • region=global: Might let you limit searches; not of interest to us.
  • page=0: Aha! This tells the API which page of results to return!

Case 2: API Interception: Getting the Data

  • So to get the data automatically, we can call the API with different values for page.
  • Here’s a simple example using a for loop:
# set up an empty tibble for our results
results <- tibble::tibble()

for (page in 0:num_pages){
  # assume the base url is in a variable called base_url
  url <- paste0(base_url, page)
  # call a function to call the API and parse the results
  result <- call_circlek_api(url)
  # add the result to our big results table
  results <- dplyr::bind_rows(results, result)
}

Case 2: Circle K’s Global Empire

After some (off-screen) data collection, we can plot a heatmap of 9,600 global Circle K locations:

Closing Considerations: Etiquette

In closing, a few suggestions for web-scraping etiquette:

  • Please don’t overwhelm web servers: space out requests using Sys.sleep().
    • some sites will block your IP if you make too many requests too quickly.
  • Please don’t scrape more than you really need.
    • Traffic costs add up.
  • Scraping password-protected data might not be a good idea.
    • Yes, you might have access, but what are the terms of use?

Thanks! Questions?

Annex A: Essential Tools

  • R Packages:
    • httr: For low-level interaction with websites using functions like httr::GET() and httr::POST(), and for interpeting their responses with httr::content().
    • rvest: An extremely well-supported package devoted to “harvesting” web data.
    • jsonlite: For parsing API responses in JSON format.
    • RSelenium: For automated browser-based scraping.
  • Browser-Based Tools:
    • SelectorGadget: A point-and-click Chrome plug-in for finding CSS selectors.
    • Chrome DevTools: A Chrome tool (Ctrl-Shift-J on Windows) we’ll use for monitoring network traffic and API calls.
    • Chrome view-source: An easy way to see a web site’s underlying html.